Pictured above is a car crash on the Bay Bridge, in Maryland.
In order to demonstrate the "Data Science Pipeline", I am using the State of Maryland's vehicular accident/crash data. It should come as no surprise to hear that driving is one of the most dangerous forms of transportation. It's estimated that 2 out of 3 drivers will get into an injury accident in their life (i.e. a vehicular accident that causes physical injury). Aggressive driving and speeding are to blame for most car accidents. Understanding and promoting awareness about the real dangers of driving is incredibly important. It's something that many of us take for granted, but can become very dangerous very quickly.
Sources: https://www.drive-safely.net/driving-statistics/ (facts)& https://www.nbcnews.com/news/us-news/heroic-car-crash-witness-saves-toddler-who-was-ejected-maryland-n1266212 (image)
After looking through many data sets from https://opendata.maryland.gov/, I eventually settled on Maryland Statewide Vehicle Crashes. It provided me with data ranging from January 2015 to December 2020, from all the counties in Maryland. Not only is it relatively recent, but it provided me with a lot to work with. This website has data sets for just about everything related to Maryland, though! I got my data from https://opendata.maryland.gov/Public-Safety/Maryland-Statewide-Vehicle-Crashes/65du-s3qu. There is a handy link to download the csv file, and the website has a visualization feature too!
!pip install folium
import folium
import requests
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
Requirement already satisfied: folium in /opt/conda/lib/python3.8/site-packages (0.12.1) Requirement already satisfied: requests in /opt/conda/lib/python3.8/site-packages (from folium) (2.25.1) Requirement already satisfied: branca>=0.3.0 in /opt/conda/lib/python3.8/site-packages (from folium) (0.4.2) Requirement already satisfied: numpy in /opt/conda/lib/python3.8/site-packages (from folium) (1.19.5) Requirement already satisfied: jinja2>=2.9 in /opt/conda/lib/python3.8/site-packages (from folium) (2.11.2) Requirement already satisfied: MarkupSafe>=0.23 in /opt/conda/lib/python3.8/site-packages (from jinja2>=2.9->folium) (1.1.1) Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/conda/lib/python3.8/site-packages (from requests->folium) (1.26.3) Requirement already satisfied: idna<3,>=2.5 in /opt/conda/lib/python3.8/site-packages (from requests->folium) (2.10) Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.8/site-packages (from requests->folium) (2020.12.5) Requirement already satisfied: chardet<5,>=3.0.2 in /opt/conda/lib/python3.8/site-packages (from requests->folium) (4.0.0)
#Read comma-separated vales into a dataframe (called t).
t = pd.read_csv("https://opendata.maryland.gov/api/views/65du-s3qu/rows.csv?accessType=DOWNLOAD")
t.head()
/opt/conda/lib/python3.8/site-packages/IPython/core/interactiveshell.py:3146: DtypeWarning: Columns (34,46) have mixed types.Specify dtype option on import or set low_memory=False. has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
| YEAR | QUARTER | LIGHT_DESC | LIGHT_CODE | COUNTY_DESC | COUNTY_NO | MUNI_DESC | MUNI_CODE | JUNCTION_DESC | JUNCTION_CODE | ... | FEET_MILES_FLAG_DESC | FEET_MILES_FLAG | DISTANCE_DIR_FLAG | REFERENCE_NO | REFERENCE_TYPE_CODE | REFERENCE_SUFFIX | REFERENCE_ROAD_NAME | LATITUDE | LONGITUDE | LOCATION | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2020 | Q2 | Daylight | 1.00 | Baltimore | 3.0 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 39.277263 | -76.503693 | POINT (-76.5036932 39.27726285) |
| 1 | 2020 | Q2 | NaN | 6.02 | Baltimore City | 24.0 | NaN | NaN | Non Intersection | 1.0 | ... | Miles | M | N | NaN | NaN | NaN | NORTH AVE | 39.311025 | -76.616429 | POINT (-76.616429453205 39.311024794431) |
| 2 | 2020 | Q2 | Daylight | 1.00 | Montgomery | 15.0 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 39.140680 | -77.193413 | POINT (-77.193412729561 39.140680249069) |
| 3 | 2017 | Q2 | Daylight | 1.00 | Baltimore City | 24.0 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 39.282928 | -76.635215 | POINT (-76.6352150952347 39.2829284750108) |
| 4 | 2020 | Q2 | Daylight | 1.00 | Cecil | 7.0 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 39.611028 | -75.951314 | POINT (-75.951314 39.611027833333) |
5 rows × 56 columns
Not only is data (sometimes) read messily into a dataframe, but I certainly did not need all 56 columns that my data set provided me with. I was able to get rid of most of them, including columns 34 and 46 (that were eliciting a warning above). Most of the columns containted numerical representations of the descriptions, and the key was not provided so the numbers were virtually useless. I ended up keeping:
#data tidying (keeping the columns that I want)
t = t.filter(['YEAR', 'QUARTER', 'LIGHT_DESC', 'COUNTY_DESC', 'JUNCTION_DESC', 'SURF_COND_DESC', 'REPORT_TYPE', 'ACC_DATE', 'ACC_TIME', 'LATITUDE', 'LONGITUDE'])
t.head()
| YEAR | QUARTER | LIGHT_DESC | COUNTY_DESC | JUNCTION_DESC | SURF_COND_DESC | REPORT_TYPE | ACC_DATE | ACC_TIME | LATITUDE | LONGITUDE | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2020 | Q2 | Daylight | Baltimore | NaN | NaN | Property Damage Crash | 20200618 | 15:15:00 | 39.277263 | -76.503693 |
| 1 | 2020 | Q2 | NaN | Baltimore City | Non Intersection | Dry | Injury Crash | 20200430 | 06:39:00 | 39.311025 | -76.616429 |
| 2 | 2020 | Q2 | Daylight | Montgomery | NaN | NaN | Injury Crash | 20200504 | 09:46:00 | 39.140680 | -77.193413 |
| 3 | 2017 | Q2 | Daylight | Baltimore City | NaN | NaN | Injury Crash | 20170507 | 10:39:00 | 39.282928 | -76.635215 |
| 4 | 2020 | Q2 | Daylight | Cecil | NaN | NaN | Property Damage Crash | 20200414 | 17:32:00 | 39.611028 | -75.951314 |
The column names were changed to make things easier to type, to take up less space, and because I did not like them before. They all have the same meanings as before, they are just a little more concise now.
#columns renamed
t.columns = ['Year', 'Qtr', 'L/D', 'County', 'Junc', 'Surface', 'Severity', 'Date', 'Time', 'Lat', 'Long']
t.head()
| Year | Qtr | L/D | County | Junc | Surface | Severity | Date | Time | Lat | Long | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2020 | Q2 | Daylight | Baltimore | NaN | NaN | Property Damage Crash | 20200618 | 15:15:00 | 39.277263 | -76.503693 |
| 1 | 2020 | Q2 | NaN | Baltimore City | Non Intersection | Dry | Injury Crash | 20200430 | 06:39:00 | 39.311025 | -76.616429 |
| 2 | 2020 | Q2 | Daylight | Montgomery | NaN | NaN | Injury Crash | 20200504 | 09:46:00 | 39.140680 | -77.193413 |
| 3 | 2017 | Q2 | Daylight | Baltimore City | NaN | NaN | Injury Crash | 20170507 | 10:39:00 | 39.282928 | -76.635215 |
| 4 | 2020 | Q2 | Daylight | Cecil | NaN | NaN | Property Damage Crash | 20200414 | 17:32:00 | 39.611028 | -75.951314 |
After renaming columns, some new columns had to be added (for convinience later). I added a datetime object, using the Date column from the dataframe. I also added a DaysSinceStart counter. I figured it would give me a little more to play around with later on, so I'm not soley relying on the datetime object. I also added a Y/NSevere column. This column has a 1 in it if the crash was classified as Injury Crash or Fatal Crash. It received a 0 otherwise. And, lastly, I added a %Light column to give this categorical variable some sense of numerical. I classified Day light as 1.0, Dawn/Dusk as .5, Dark Lights On as .25, and everything else as 0 (as it's in the darkness).
I also decided to drop any rows that were missing data. My data set has 666K rows, so I could lose a few without impacting the results. I also did not have a meaningful average for the categorical variables, so it made more sense to just exclude them.
#DateTime column created
t['DateTime'] = pd.to_datetime(t['Date'], format="%Y%m%d")
#DaysSinceStart column creted
start = pd.to_datetime("20150101", format="%Y%m%d")
t['DaysSinceStart'] = t['DateTime'] - start
#Y/NSevere column created
t['Y/NSevere']=t['Severity'].apply(lambda x: 1 if x == "Injury Crash" or x == "Fatal Crash" else 0)
#%Light column created
t['%Light']=t['L/D'].apply(lambda x: 1.0 if x == "Daylight" else .5 if x == "Dawn" or x == "Dusk" else
.25 if x == "Dark Lights On" else 0)
#Missing data is excluded
t = t.dropna()
t.head()
| Year | Qtr | L/D | County | Junc | Surface | Severity | Date | Time | Lat | Long | DateTime | DaysSinceStart | Y/NSevere | %Light | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10 | 2020 | Q2 | Daylight | Baltimore City | Intersection | Wet | Property Damage Crash | 20200426 | 15:14:00 | 39.312903 | -76.651472 | 2020-04-26 | 1942 days | 0 | 1.00 |
| 18 | 2020 | Q2 | Daylight | Kent | Non Intersection | Dry | Property Damage Crash | 20200510 | 19:16:00 | 39.349181 | -75.878126 | 2020-05-10 | 1956 days | 0 | 1.00 |
| 21 | 2020 | Q2 | Dark Lights On | Queen Anne's | Non Intersection | Dry | Injury Crash | 20200513 | 01:24:00 | 39.204898 | -76.052318 | 2020-05-13 | 1959 days | 1 | 0.25 |
| 53 | 2020 | Q2 | Dark No Lights | Carroll | Non Intersection | Dry | Property Damage Crash | 20200618 | 23:31:00 | 39.478947 | -77.068709 | 2020-06-18 | 1995 days | 0 | 0.00 |
| 56 | 2020 | Q2 | Daylight | Howard | Non Intersection | Dry | Property Damage Crash | 20200605 | 16:27:00 | 39.222951 | -76.808682 | 2020-06-05 | 1982 days | 0 | 1.00 |
In order to try to identify trends or patterns, the data will be visualized in maps and graphs. My initial predictions are that we will see trends that many people already associate with distracted and/or dangerous driving. I think that they may happen more often at night, as there is limited visibility.
#yearly variable created by grouping the original dataframe by the years. it was then used to make the line plot.
yearly = t.groupby('Year')
yearly = yearly.size()
yearly.plot(kind='line', title='Crashes by Year', xticks=[2016, 2017, 2018, 2019, 2020], xlabel='Year',
ylabel='Number of Crashes', figsize=(10, 8))
<AxesSubplot:title={'center':'Crashes by Year'}, xlabel='Year', ylabel='Number of Crashes'>
To get started, I decided to look at the yearly trends. After grouping all the data by the year, and plotting it, it is pretty clear that something happened in 2018. After reaching its peak, the number of crashes drastically dropped by about 25,000 crashes in 2020. I felt the need to break this down more, to see what's really going on. Perhaps crashes peaked as society began to condemn those who drive dangerously and/or distracted?
#qtrly variable created by grouping the original dataframe by the quarters. it was then used to make the line plot.
qtrly = t.groupby('Qtr')
qtrly = qtrly.size()
qtrly.plot(kind='line', title='Crashes by Quarter', xlabel='Quarter',
ylabel='Number of Crashes', figsize=(10, 8))
<AxesSubplot:title={'center':'Crashes by Quarter'}, xlabel='Quarter', ylabel='Number of Crashes'>
After breaking things down into quarters, the trends continue. The majority of crashes occurred in the fourth quarter (the winter months). Perhaps due to inclimate weather? Perhaps due to the many holidays? Further breaking down is needed to understand why. But, this is interesting, and it raises some more questions.
#timely variable created by grouping the original dataframe by their times (of day). it was then used to make the
#line plot.
timely = t.groupby('Time')
timely = timely.size()
timely.plot(kind='line', title='Crashes by Time of Day', xlabel='Time', ylabel='Number of Crashes', figsize=(10, 8))
<AxesSubplot:title={'center':'Crashes by Time of Day'}, xlabel='Time', ylabel='Number of Crashes'>
Not too surprisingly, the majority of crashes appear to occurr in the middle of the day, when the most people are on the roads. There is probably a lot more aggressive driving and speeding that occurrs when there are more people out on the roads. And, as mentioned in the introduction, they are the top two reasons for crashes. While I do think that speeding is likely a nighttime problem as well (with things like street racing), it seems far more dangerous when the roads are filled with other drivers. Not only are you at a higher risk of crashing, but you could put others at the risk of crashing (regardless, the event would be contributing to this data set). I'd like to have a look at some other variables though.
#ld variable created by grouping the original dataframe by their light classification. it was then used to make the
#pie graph.
ld = t.groupby('L/D')
ld = ld.size()
ld.plot(kind='pie', title='Crashes by Amount of Daylight', xlabel='Daylight', ylabel='Daylight', figsize=(10, 8))
<AxesSubplot:title={'center':'Crashes by Amount of Daylight'}, ylabel='Daylight'>
This is where things start to get a little confusing. I had thought, originally, that the most car accidents would likely occur at night (or at least at dusk/dawn). The last line graph (the time of day plot) does uphold this pie chart. It appears the majority of crashes occur in the daylight. As stated before, though, it is likely due to the sheer number of people on the road at the same time.
#cond variable created by grouping the original dataframe by the surface conditions of the road. it was then used
#to make the pie graph.
cond = t.groupby('Surface')
cond = cond.size()
cond.plot(kind='pie', title='Crashes based on Road Conditions', xlabel='Surface Conditions', ylabel='Surface Conditions',
figsize=(10, 8))
<AxesSubplot:title={'center':'Crashes based on Road Conditions'}, ylabel='Surface Conditions'>
Even more surprisingly (to me), than the amount of light, is the conditions of the roads. For the majority of crashes, it appears that the roads are dry. I would have thought that wet conditions would lead to more crashes (and that's what they emphasized in Driver's Ed. too!) It begs the question of whether most of the crash data is externally driven, or if it was mostly distracted drivers. This is data that I, unfortunately, could not get. There were plently of data sets on victims of car crashes, the cars involved, the roads they occurred on, etc. However, there was no way to connect the exact cases (no case numbers, some didn't have dates/times, etc.). This theory would be interesting to look into further.
#sev variable created by grouping the original dataframe by the severity of the aftermath. it was then used to make
#the pie graph.
sev = t.groupby('Severity')
sev = sev.size()
sev.plot(kind='pie', title='Crashes based on Severity', ylabel='Severity', figsize=(10, 8))
<AxesSubplot:title={'center':'Crashes based on Severity'}, ylabel='Severity'>
Luckily, only a small amount of crashes were fatal from 2015 to 2020! The majority appear to have been property damage crashes, which is understandable as the vehicles were likely damaged after crashing into something.
#junc variable created by grouping the original dataframe by the junction where it occurred. it was then used to
#make the pie graph.
junc = t.groupby('Junc')
junc = junc.size()
junc.plot(kind='pie', title='Crashes based on the Junction', ylabel='Junc', figsize=(10, 8))
<AxesSubplot:title={'center':'Crashes based on the Junction'}, ylabel='Junc'>
Unfortunately, there were so many entries that listed 'Not Applicable' for this section. I feel that the junction could be very telling about other contributing factors, and almost a quarter of this data is 'Not Applicable'. Perhaps they occurred in a neighborhood? Or maybe between family members at home? Its impossible to tell. However, given what we do know, Intersection and Intersection Related comprise about the same percentage of data as Non Intersection. Given that there are not too many railways or crossovers, this makes sense. The categories are a little bit broad though.
#because there are too many entries to plot on a map, I am randomly choosing 500 to geographically analyze.
sampledata = t.sample(n=500)
#the starting map is (roughly) centered about Maryland
map1 = folium.Map(location=[38.52, -76.61], zoom_start=7)
#for every row in the sample data, if it was within one of the selected counties, then it was given its associated
#color. otherwise, it was pink.
for index, row in sampledata.iterrows():
#Montgomery County is light blue
if row["County"] == "Montgomery":
folium.Marker(location=[row["Lat"], row["Long"]], icon=folium.Icon(color="lightblue", icon="asterisk")).add_to(map1)
#Prince George's County is light green
elif row["County"] == "Prince George's":
folium.Marker(location=[row["Lat"], row["Long"]], icon=folium.Icon(color="lightgreen", icon="asterisk")).add_to(map1)
#Baltimore County is purple
elif row["County"] == "Baltimore":
folium.Marker(location=[row["Lat"], row["Long"]], icon=folium.Icon(color="purple", icon="asterisk")).add_to(map1)
#Baltimore City is orange
elif row["County"] == "Baltimore City":
folium.Marker(location=[row["Lat"], row["Long"]], icon=folium.Icon(color="orange", icon="asterisk")).add_to(map1)
#Anne Arundel County is cadet blue
elif row["County"] == "Anne Arundel":
folium.Marker(location=[row["Lat"], row["Long"]], icon=folium.Icon(color="cadetblue", icon="asterisk")).add_to(map1)
#Any others are pink
else:
folium.Marker(location=[row["Lat"], row["Long"]], icon=folium.Icon(color="pink", icon="asterisk")).add_to(map1)
map1
#colored the most populated counties, and that's where the most accidents are...
When plotting on a map, it is important to remember not to over-generalize. With too few colors, or too few distinguishing icons, it can be difficult to visualize what's going on (which defeats the whole purpose). With that being said, the opposite holds too. Too many can also be counterintuitve. I thought that 23 different colors (for the 23 different counties) would be too much. I decided to color code the five most populated counties in Maryland. They were (in order) Montgomery County, Prince George's County, Baltimore County, Baltimore City, and Anne Arundel County. Technically , Baltimore City was not part of the top five, but I differentiated all of Baltimore City's surrounding counties and I did not want it to get lost in the mix of colors(so I changed its color with the other populated counties). The majority of crashes appears to occur in these heavily populated areas. This makes sense, as more people on the same roads increases the odds of a crash.
#formatting the Time column to be transformed into an integer
t['Time'] = t['Time'].replace({':':''}, regex=True)
#transforming the Time column from a column of strings to a column of ints
t['Time'] = t['Time'].astype(int)
#creating an array of important values to standardize
temp = t[['Year', 'Date', 'Time', 'Y/NSevere', '%Light']].values
#standardizing with sklearn
minmaxsc = preprocessing.MinMaxScaler()
scaledtemp = minmaxsc.fit_transform(temp)
#making a new dataframe for the standardized data
stddf = pd.DataFrame(scaledtemp)
#also putting the standardized time into the original dataframe (as a new column)
t['stdTime'] = stddf[2]
#removing any rows with Nan
t = t.dropna()
#plotting the standardized time along with year
plt.scatter(t['Year'], t['stdTime'])
#linear regression
X = np.array(t['Year'].values).reshape(-1, 1)
y = np.array(t['stdTime'].values).reshape(-1,1)
linreg = linear_model.LinearRegression()
linreg.fit(X, y)
#plotting the linear regression
plt.plot(X, linreg.predict(X))
figure(figsize=(30, 26), dpi=80)
print(linreg.coef_)
print(linreg.intercept_)
[[0.00261252]] [-4.7045636]
<Figure size 2400x2080 with 0 Axes>
Throughout the data analysis, time has been a important recurring factor. The distribution of the data in the plot was not what I expected it to be. So, I decided to standardize it to ensure that I was getting consistent results. Standardization makes it a little easier to compare data, too. Ultimately, the standardiztion allowed me to get a linear regression model. The linear regression line looks very straight, although there is a small slope to it. This model is very linear. And, the nearly straight slope indicates that perhaps this data set isn't best suited for only a linear regression model. Other machine learning tools would allow us to delve deeper into the analysis of this data set, in addition to the linear regression.
While there is never one, simple answer to issues like this, we can certainly see some related factors. The time and location of the crashes in Maryland are pretty understandable, due to external factors. When there are more people out on the roads, it is more likely to get into a car accident. Even if you were incredibly safe, you cannot always trust others. This is also shown in the daylight/darkness chart. The majority of crashes occur when it is light out. And, it makes sense that there will be more crashes in more heavily populated areas. Baltimore County and Baltimore City definitely make sense. There is so much traffic crammed into a (relatively) small area. I was particularly shocked by the road conditions. I really thought that the majority of accidents would happen when the roads are wet and slick. But, I stand corrected. This reinforces the importance of reducing distracted driving. If external factors are not causing people to get into crashes, then it is likey the driver. It also makes perfect sense that the majority of crashes are property related damages. Ultimately, even if nobody is hurt, it is likely that the car(s) involved will have some sort of damage.
It would be interesting to see if this data continues into and after the 2021 year's worth of crashes. Looking at the year line graph, it appears that it may continue to decrease. A culture change has certainly helped with this, too. It's no longer "cool" to text and drive. Many people have begun to call out their friends and family when they see it. It will be interesting to see if that plays any role in the (possible) reduction in crashes next year.
If you would like more information on car crashes in Maryland, or information on the dangers of driving, these resources may be a good starting point.